Google Account
Rohan Dutta
rohandutta2107@gmail.com
This notebook is open with private outputs. Outputs will not be saved. You can disable this in Notebook settings.
Code
Insert code cell below
Ctrl+M B
Text
Add text cell
Notebook

Project Name -

Code Text

Project Type - EDA/AirBnb Booking Analysis
Contribution - Individual
Name - Rohan Dutta
Code Text

Project Summary -

Code Text

The analysis of the AirBnb datset have revealed valuable insights into various aspect of listings in New York City. The majority of the lsitings are in the "Entire home/apt" and "Private Room" categories indicating preference for privacy and independence. Manhattan and Brooklyn are the most popular neighbourhood with higher price and longer minimum stay requrirements compared to other areas. The analysis has also highlighted the importance of pricing and availability, with variations based on room types and neighbourhoods. Geographical analysis had identified specific regions with higher prices and longer minimum stay requirement, providing guidance for selecting suitable accomodations.

Based on these insights, several solutions have been purposed to address the business problems. These include the refining price strategies promoting underrepresented nieghbourhoods, enhancing search filter and user interface and improving guest experience. The findings also emphasize the significance of data-driven decision making and collaborations with local authorities to optimize operations and comply with regulations. By leveraging these insight, AirBnb can enhance the user experience attarct more host and guests and improve overall business performance in competetive short-term rental market in New York City.

Code Text

GitHub Link -

Code Text

ψ

Double-click (or enter) to edit

Code Text

Problem Statement

Code Text

Performing Exploratory Data Analysis (EDA) on the AirBnb dataset poses several challenges due to its vast size, comprising millions of listings. Effective startegies are required to process and analyze the this extensive dataset to extract valuable insights and pattern. The goal is to uncover trends, patterns and key features of the data to enhance decision-making and improve the eoverall AirBnb experience for host and user.


Let's Begin !

Code Text

1. Know Your Data

Code Text

Import Libraries

Code Text

# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import missingno as msn
import folium
import warnings
Code Text

from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Code Text

Dataset Loading

Code Text

# Load Dataset
df = pd.read_csv('/content/Airbnb NYC 2019.csv')
Code Text

Dataset First View

Code Text

# Dataset First Look
df.head().style.background_gradient(cmap='cool')
Code Text

Dataset Rows & Columns count

Code Text

# Dataset Rows & Columns count
num_rows = df.shape[0]
num_columns = df.shape[1]

print("Dataset Rows count:", num_rows)
print("Dataset Columns count:", num_columns)
Dataset Rows count: 48895
Dataset Columns count: 16
Code Text

Dataset Information

Code Text

# Dataset Info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     38843 non-null  object 
 13  reviews_per_month               38843 non-null  float64
 14  calculated_host_listings_count  48895 non-null  int64  
 15  availability_365                48895 non-null  int64  
dtypes: float64(3), int64(7), object(6)
memory usage: 6.0+ MB

Duplicate Values

Code Text

# Dataset Duplicate Value Count
d = df.duplicated().sum()
print(f'Dataset Duplicate Value Count is {d}')
Dataset Duplicate Value Count is 0

Missing Values/Null Values

Code Text

# Missing Values/Null Values Count
df.isnull().sum().sort_values(ascending = False)
last_review                       10052
reviews_per_month                 10052
host_name                            21
name                                 16
id                                    0
host_id                               0
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
calculated_host_listings_count        0
availability_365                      0
dtype: int64
Code Text

# Visualizing the missing values
msn.bar(df,color='cyan')
Code Text

What did you know about your dataset?

Code Text

Duplicate Values: The dataset does not contain any duplicate values. This implies that each row in dataset represents a unique record.

Missing Values: Some columns have missing values. The column 'name', 'host_name', 'last_review', and 'review_per_month' have missing values, as indicated by the count of non-null values being less than the total number of rows. These missing values may need to be handeled appropriately depending on the specific analysis or use case.

Data Types: The dataset contains the mixing of data types. It includes integer, float and object (string) data types. The data types provide information about the nature of the variables and can help determine appropriate statistical or analytics techniques for further analysis.

Code Text

2. Understanding Your Variables

Code Text

# Dataset Columns
df.columns
Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')
Code Text

# Dataset Describe
df.describe()
Code Text

Variables Description

Code Text

  • id: Listing ID(int64)

  • name: Name of the listing(object)

  • host_id: host ID(int64)

  • host_name: name of the host(objet)

  • nieghbourhood_group: area(object)

  • latitude: latitude co-ordinates(float64)

  • longitude: longitude co-ordinates(float64)

  • room_type: listing space type(object)

  • price: price in dollars(int64)

  • minimum_nights: amount of nights in days(int64)

  • number_of_reviews: number of reviews (int64)

  • last_review: latest review(object)

  • reviews_per_month: number of review per month(float64)

  • calculated_host_listings_count: amount of listings per host(int64)

  • availablity_365: number of days when listing is available(int64)


Check Unique Values for each variable.

Code Text

# Check Unique Values for each variable.
def unique_values(x):
  return df[x].unique()
for i in df:
  if i == 'neighbourbood' or i == 'reviews_per_month' or i == 'price':
    continue
else:
  print('-'*50)
  print(''*50)
  print('unique values of',i)
  print(unique_values(i))
  print('-'*50)
  print('-'*50)
--------------------------------------------------

unique values of availability_365
[365 355 194   0 129 220 188   6  39 314 333  46 321  12  21 249 347 364
 304 233  85  75 311  67 255 284 359 269 340  22  96 345 273 309  95 215
 265 192 251 302 140 234 257  30 301 294 320 154 263 180 231 297 292 191
  72 362 336 116  88 224 322 324 132 295 238 209 328  38   7 272  26 288
 317 207 185 158   9 198 219 342 312 243 152 137 222 346 208 279 250 164
 298 260 107 199 299  20 318 216 245 189 307 310 213 278  16 178 275 163
  34 280   1 170 214 248 262 339  10 290 230  53 126   3  37 353 177 246
 225  18 343 326 162 240 363 247 323 125  91 286  60  58 351 201 232 258
 341 244 329 253 348   2  56  68 360  76  15 226 349  11 316 281 287  14
  86 261 331  51 254 103  42 325  35 203   5 276 102  71  78   8 182  79
  49 156 200 106 135  81 142 179  52 237 204 181 296 335 282 274  98 157
 174 223 361 283 315  36 271 139 193 136 277 221 264 236  89  23 218 235
 119 350 161 259  27 167 358  59 337  43  25 127 303 115 268  44  65 252
  64 111  90 338  31 241 285 183  84 166  28  83 305 356 308 229 210 153
 332 120 313  69 293   4 300  40 117 206 144 354  41 270 306  33  50  80
  97 118 134  17 289 121 205  74  62  29 109 168 146 242 352 155 291 266
 101 190 327 217 171 110  87 202  70 147 169 212 122 330  54 196  57  73
 149 239  63 195  47 319  19 112 344  77 160 141  13  24 150 128 176 357
 211 172 256 165  32 105 267 148  93  45 175 159  48 100 184 114 133 186
 334  94 151 228 113  55  66 173 104 197  99 131 143 124 130 187 145 108
 123  92  61 138 227  82]
--------------------------------------------------
--------------------------------------------------
Code Text

3. Data Wrangling

Code Text

Data Wrangling Code

Code Text

# Write your code to make your dataset analysis ready.
df.dropna(inplace = True) #drop all the NaN/Null/Missings values.
Code Text

4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables

Code Text

Chart - 1

Code Text

# Chart - 1 visualization code
# Distribution chart of dataset
# setting figure size
figure = plt.figure(figsize=(15,10))
# setting the axis
ax = figure.gca()
# creating the chart
df.hist(ax = ax, color = 'green')
# setting tittle and parameters
plt.title('Histogram of df features', fontsize = 15)
plt.xlabel('Value', fontsize = 12)
plt.ylabel('Frequency', fontsize = 12)
# display the figure
plt.show()

Chart - 2

Code Text

# Chart - 2 visualization code
# Distribution by room type
# Setting chart size
plt.figure(figsize = (10,6))
# setting data for chart creation
room_type_counts = df['room_type'].value_counts()
# Setting additional parameter
explode = (0.05, 0., 0.)
# Creating the chart
plt.pie(room_type_counts, labels = room_type_counts.index, autopct='%1.1f%%', explode = explode)
plt.axis('equal')
# Setting the title
plt.title('Room Types')
# Display the chart
plt.show()
Code Text

ψ

Observations

Code Text

ψ
  • Dominant Categories: The majority of listings fall into two main categories - 'Entire Home/ Apt.' and 'Pvt. Room'. 'Entire Home/Apt.' has the highest count of 52% listings, followed by 'Pvt. Room' with 47.5% listings.

  • Limited Shared Rooms: The count for 'Shared Rooms' listings is relatively low, with only 2.4% listings. This indicates that shared accomodations are less common in the dataset.

  • Preferance for Privacy: The higher counts of 'Entire Room/Apt.' and 'Pvt Room'suggest that guests tend to prefer accomodations that offer more privacy and indpendence.

This insights summarize the distribution of the room types and highlighted teh preference for privacy and independent accomodations in the datasets.

Code Text

Chart - 3

Code Text

# Chart - 3 visualization code
# Countplot of Neighbourhood group
# Set the figure size
plt.figure(figsize= (15,3))
# Creating chart
ax = sns.countplot(data = df, x = 'neighbourhood_group')
# Setting labels
ax.set_xlabel('Neighbourhod Group')
ax.set_ylabel('Count')
# Displaying the chart
plt.show()
Code Text

ψ

Observations

Code Text

ψ
  • Manhattan and Brooklyn are the most represented neighbourhood groups, with 21,652 and 20,098 listings, respective.

  • Queens, Staten Island and Bronx have fewer listings, with 5,666, 373 and 1,090 listings resp.

  • Manhattan and Brooklyn are popular choices for AirBnb listings, potentially due to their attractions and demand for short-term rentals.

These insight highlight teh dominance of Manhttan and Brooklyn in the dataset, the relatively lower representation of Queens, Bronx and Staten Island and the popularity of accomodations in Manhattan and Brooklyn for AirBnb listings in the New York City.


Chart - 4

Code Text

# Chart - 4 visualization code
# Hisplot of Reviews per month
# Set figure size
plt.figure(figsize= (10,4))
# Creating the chart
sns.histplot(data=df, x = 'reviews_per_month',bins = 60)
plt.xlabel('reviews_per_month')
plt.ylabel('Density')
plt.title('Reviews Per Month')
# Displaying the Chart
plt.show()
Code Text

ψ

Observations

Code Text

ψ

Majority of reviews are nearly 1 and 90 percent of the data is falling below 5.

Code Text

Chart - 5

Code Text

# Chart - 5 visualization code
# KDE plot for Distribution of price
# Setting chart size
plt.figure(figsize= (10,6))
# creating violin plot
sns.violinplot(data = df, x = 'price', color ='deepskyblue')
# Setting labels and other parameters
plt.xlabel('Price')
plt.ylabel('Density')
plt.title('Price Distribution')
# Display the plot
plt.show()
Code Text

# mode of price
df['price'].mean(),df['price'].mode(),df['price'].median()
(142.33252621004095,
 0    150
 Name: price, dtype: int64,
 101.0)
Code Text

Chart - 6

Code Text

# Chart - 6 visualization code
# Displot the last reviews
# Setting the chart
plt.figure(figsize=(10,8))
# Creating Charts
sns.displot(data=df, x= 'last_review', bins=10, color = 'green')
# Setting labels and parameters
plt.xlabel('last_review')
plt.ylabel('Density')
plt.title('Last Review Distribution')
# Display the Chart
plt.show()

Chart - 7

Code Text

# Chart - 7 visualization code
# Bar plot for availability 365 and room type
# Setting chart size
plt.figure(figsize=(8,4))
# Creating charts
sns.barplot(x='availability_365', y= 'room_type', data = df)
# setting label and parameters
plt.xlabel('availability 365')
plt.ylabel('room type')
plt.title('Availability 365 for Room Type')
# Display the chart
plt.show()
Code Text

ψ

Observations

Code Text

ψ
  • Shared room availabilty: Shared room have a higher availability (162) comppared to the other room types.

  • Comparable availability: Entire home/apt and private rooms have similar availability, around 110.

  • Room type impact: Room type influence availability, with share room being more available.

  • Booking Consedration: Guest seeking shared rooms have more options throughout the year, while booking in advance maybe necessary for entire home/apt. and pvt. rooms.

These insight highlight the difference in availability based on room types, with shared rooms being more readily available the need for advanced booking for entire home/apt and private rooms.

Code Text

Chart - 8

Code Text

# Chart - 8 visualization code
# Group by neighbourhood group and calculate the mean of price
grouped = df.groupby('neighbourhood_group')['price'].mean().reset_index()
# Sort the data by average 'price' in descending order
grouped = grouped.sort_values(by = 'price', ascending = False)
# create a bar plot to visulaize the average 'price' by 'neighbourhood_group'
plt.figure(figsize=(10,3))
sns.barplot(data= grouped, x = 'price', y = 'neighbourhood_group', orient= 'horizontal')
plt.title =('Average Price By Neighbourhood Group')
plt.xlabel('Average Price')
plt.ylabel('Neighbourhood Group')
# Display the Chart
plt.show()
Code Text

ψ

Observations

Code Text

ψ
  • Price for AirBnb listings vary significantly across neighbourhood groups. Manhattan has highest average price, followed by Staten Island, Brooklyn, Queens and Bronx.

  • Manhattan and Brooklyn are the most popular choices for AirBnb, with a large number of listings in both the area. Queens, Bronx and Staten Island has fewer listings compared to Manhattan and Brooklyn.

  • The bar plot visually shows the price differences, helping users make informed decisions about their accomodation based on their budget and preferences.

Code Text

Chart - 9

Code Text

# Chart - 9 visualization code
# pie chart on base of minimum nights and neighbourhood group 
# group by the 'neighbourhood_group' and calculate the mean of 'minimum nights'
plt.figure(figsize=(10,8))
explode = (0.05,0.05,0.05,0.05,0.05) #explode the slice by radius
df.groupby(df.neighbourhood_group).mean()['minimum_nights'].plot(kind='pie', figsize=(8,6),startangle=90, autopct='%.3f',shadow=True, explode = explode)
plt.ylabel('')# just let this empty for sizing 
# display the chart 
plt.show()
Code Text

ψ

Observations

Code Text

ψ
  • The average minimum nights required for AirBnb listings varies across different neighbourhood groups in New York City. Manhattan has the highest average minimum nights, followed by Queens, Brooklyn, Staten Island and the Bronx.

  • Manhattan has the longest average minimum nights, indicating a preferance for longer stays among visitors in this area. This could be due to the city's attraction and the desire for a more immersive experience.

  • Staten Island has the second highest average minimum nights, suggesting that visitors to this neighbourhood group also tend to stay for a relavility longer duration compared to other areas.

  • Brooklyn and Queens have similar avarage nights, indictaing visitors t these neighbourhood groups also opt for longer stays, through slightly shorter than those in Manhattan and Staten Island.

  • The Bronx has the lowest average minimum nights among all the neighbourhood groups, visitors can consider factors like price, desired length of stay and attractions when selecting their accomodation in New York City.


Chart - 10

Code Text

# Chart - 10 visualization code
# Scatter plot on minimum nights by price 
plt.figure(figsize=(10,6))
sns.scatterplot(data=df, x='minimum_nights', y='price' )
plt.xlabel('Minimum Nights')
plt.ylabel('Price')
plt.show()
Code Text

ψ

Observations

Code Text

ψ
  • Shorter satys (less than 5-6 nights)have varying and discrete prices, indicating price variability.

  • Longer stays show a more consistent pricing structure, forming a nearly horizontal line above zero.

  • Prices tends to decrease as the number of minimum nights increases, indicating a potential correlation between stay duration and overall cost.


Chart - 11

Code Text

# Chart - 11 visualization code
# top Hotels as per Review
plt.figure(figsize=(10,6))
# Sorting Reviews as per Descending Order
sorted_data=df.sort_values(by='number_of_reviews', ascending = False)[:30]
sns.barplot(x = sorted_data['number_of_reviews'], y = sorted_data['name'],palette='viridis')
plt.xlabel('Count')
plt.ylabel('New York City AirBnb Hotels')
plt.show()
Code Text

ψ

Observations

Code Text

ψ

Overall, the barplot highlights the top hotels with the most reviews, showcasing a mix of accomodation in various neighbourhood of New York City. This info can be helpful for travellers looking for highly reviewed option and popular destinations within the city.

Code Text

Chart - 12

Code Text

Code Text

ψ

Observations

Code Text

ψ
  • The average price tends to be higher for hosts with a calculated_host_listings_count in the range of 10.0-15.0, approx around 25.0.

  • Hosts ith a calculated_host_listings_count in the range of 0.0-5.0 have an average price of approx 160. For hosts with calculated_host_listings_count in the range of 5.0-10.0, the average price is around 100.

  • The calculated_host_listings_count range of 15.0-20.0 has a relatively lower average price, approx 120.

Code Text

Chart - 13

Code Text

# Chart - 13 visualization code
# Longitude VS Latitude VS Price
plt.figure(figsize=(10,6))
# Set labels
plt.scatter(df['longitude'],df['latitude'],df['price'], cmap='virdis', alpha=0.6)
plt.colorbar(label='price')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
#display the chart
plt.show()
Code Text

ψ

Observations

Code Text

ψ
  • Dense Concentrations: There is a denser concentraation of data points within the latitude range of -74.0 to -73.9. This indicate the higher density of AirBnb listing in this specific region.

  • Rising Trend: As we move from left to right in the chart within the specified region, there is a risisng trend. This suggests that properties located towards the eastern part of the region tend to have longer minimum saty requirements.

  • Higher Minimum Nights: Within the concentrated region, the majority of data points exhibit higher minimum stay requirements.

  • Sparse Data and Lower Minimum Nights Outside the Region: Outside the specified latitude and longitude range, there a fewer data points, indicating a lower density of AirBnb listings. Additionally, the minimum night to be tend to be lower in these area compared to the concentrated region.


ψ

Chart - 14 - Correlation Heatmap

Code Text

# Correlation Heatmap visualization code
plt.figure(figsize=(10,6))
# Labeling
sns.heatmap(df[['price','minimum_nights', 'availability_365']].corr(), annot=True)
plt.show()
Code Text

# Values as per chart \
df[['price','minimum_nights','availability_365']].corr().head().style.background_gradient(cmap='Oranges')
Code Text

ψ

Observations

Code Text

ψ

Overall, from the above heatmap reveals the mix of week positive and negative correlations among the variables. It suggests that the number of reviews and review rate per month have the srongest relationship among the variables, while the variables, while the other variables have weaker or negligible correlations with each other.

Code Text

ψ

Chart - 15 - Pair plot

Code Text

# Pair Plot Visualization code
sns.pairplot(df)
Code Text

ψ

5. Solution to Business Objective

Code Text

What do you suggest the client to achieve Business Objective ?

Explain Briefly.

Code Text

ψ

To achieve the business objective of Airbnb booking analysis, the client should adopt a data-driven approach that leverages comprehensive data insights to optimize the booking platform and enhance customer experience. Here's a step-by-step plan:

  1. Data Collection: Gather a wide range of data on Airbnb bookings, including listing details, host information, pricing, guest reviews, and booking patterns. This data can be obtained from internal databases or through web scraping publicly available sources.

  2. Data Preprocessing: Thoroughly clean and preprocess the collected data to remove duplicates, handle missing values, and correct errors. Ensuring data quality is crucial for accurate analysis.

  3. Exploratory Data Analysis (EDA): Conduct an in-depth EDA to gain insights into the data. Identify trends, seasonal patterns, and correlations between different factors that impact bookings. Understanding these patterns can help target specific areas for improvement.

  4. Market Segmentation: Segment the data based on key variables such as location, property type, and price range. Analyzing different market segments allows for tailored strategies to meet the unique demands of each segment.

  5. Demand Forecasting: Implement time series forecasting models to predict future booking demand. This will aid hosts in optimizing pricing and availability, ultimately maximizing occupancy and revenue.

  6. Sentiment Analysis: Perform sentiment analysis on guest reviews to assess customer satisfaction. Understanding guests' feedback will highlight strengths and weaknesses, leading to better service and increased positive reviews.

  7. Price Optimization: Utilize pricing algorithms to optimize listing prices based on demand, competitor pricing, and seasonal variations. Optimized pricing ensures competitiveness and attracts more bookings.

  8. Host Performance Analysis: Evaluate host performance using metrics like occupancy rates, ratings, and guest feedback. Recognize top-performing hosts and incentivize improvements in service for others.

  9. Competitive Analysis: Analyze competitors' offerings and market share to identify opportunities for differentiation. Understanding the competitive landscape can lead to strategies that set Airbnb apart in the market.

  10. Business Recommendations: Based on the insights gained from the analysis, provide actionable recommendations to enhance overall booking performance, customer satisfaction, and revenue generation. These recommendations may include personalized marketing strategies, service enhancements, and targeted promotions.

By implementing these steps, Airbnb can optimize its booking platform, attract more guests, retain satisfied hosts, and ultimately achieve its business objective of maximizing bookings and revenue while delivering exceptional customer experiences. Data-driven decisions will be the cornerstone of success in the dynamic and competitive vacation rental market.

Code Text

Conclusion

Code Text

ψ

The Airbnb Booking Analysis provides valuable data-driven insights to optimize the platform and enhance customer experience. Through comprehensive data collection, exploratory analysis, and demand forecasting, Airbnb can implement pricing algorithms, recognize top-performing hosts, and identify opportunities for differentiation. Sentiment analysis of guest reviews enables improvements in service, ultimately maximizing bookings and revenue. The competitive analysis helps Airbnb stay ahead in the market. By acting on the recommended strategies, Airbnb can attract more guests, retain satisfied hosts, and achieve its business objective of sustained growth and success in the vacation rental industry.

Code Text